Are you a ‘data digester’? Exploring California’s new AI training registration bill
Posted: March 11, 2024
California is considering a bill, AB-3204, that would require businesses that use personal information to train AI to register with the California Privacy Protection Agency (CPPA).
The training of AI systems, large language models (LLMs) in particular, often involves large amounts of personal information. Governments and regulators worldwide are trying to figure out how to protect people against the risks to their privacy presented by this process.
AB-3204 provides no new restrictions or rules over the AI training process, but it would require many businesses to register and would impose steep fines on those who fail to do so when required.
What’s a ‘data digester’?
The bill defines a “data digester” as “a business that uses personal information to train artificial intelligence.”
Does this apply to any business that uses personal information to train AI? It would appear not.
AB-3204 states that it would “incorporate specified definitions” from existing California law, namely the California Consumer Privacy Act (CCPA), as amended by the California Privacy Rights Act (CPRA).
So it’s likely that only businesses that are covered by the CCPA, and that train AI on “personal information” as defined in the CPPA, would meet the “data digester” definition.
Relevant definitions from the CCPA
There are several terms used in AB-3204 that could derive their meaning from the CCPA.
Under the CCPA, a “business” is a legal entity doing business in California that met one or more of the following criteria in the preceding calendar year:
- It earned $25 million or more in annual revenues
- It bought, sold, or shared personal information about 100,000 or more consumers
- It made 25% or more of its annual revenues from selling consumers’ personal information.
The CCPA provides many exemptions, including for regulated businesses in the health, credit-rating, and finance industries, but the definition of “consumer” includes all California residents – including a company’s employees and commercial partners’ employees.
“Personal information” is also defined broadly and can include any “information that identifies, relates to, describes, is reasonably capable of being associated with, or could reasonably be linked, directly or indirectly, with a particular consumer or household.”
AB-204 also references “sensitive personal information”, which is defined in the CCPA as the following types of personal information:
- Social security, driver’s license, state identification card, or passport number
- Account log-in, financial account, debit card, or credit card number in combination with any required security or access code, password, or credentials allowing access to an account
- Precise geolocation
- Racial or ethnic origin, citizenship or immigration status, religious or philosophical beliefs, or union membership
- The contents of a consumer’s mail, email, and text messages (unless the business is the intended recipient)
- Genetic data
- Personal information collected and analyzed concerning a consumer’s health
- Personal information collected and analyzed concerning a consumer’s sex life or sexual orientation
Neither the CCPA nor AB-204 defines “artificial intelligence” or “training”.
As such, the law could capture relatively common activities, such as improving email spam filters or using AI features in products like Zoom or Microsoft Office.
What information must a data digester provide when registering?
Much like California’s data broker registration law, recently amended by the Delete Act, AB-204 requires “data digesters” to provide the following information when registering with the CPPA:
- Its name and “primary” physical, email, and website addresses
- Each category of personal information it uses to train AI (the relevant categories of personal information are set out in the CCPA)
- Each category of sensitive personal information it uses to train AI
- Each category of information related to consumers’ receipt of sensitive services, which include health care services related to “mental or behavioral health, sexual and reproductive health, sexually transmitted infections, substance use disorder, gender-affirming care, and intimate partner violence,” as defined under California Civil Code § 56.05
- Whether it trains AI using minors’ personal information
- Whether and to what extent it or any of its subsidiaries is regulated by:
- The federal Fair Credit Reporting Act
- The federal Gramm-Leach-Bliley Act
- The federal Driver’s Privacy Protection Act of 1994
- The Insurance Information and Privacy Protection Act
- The Confidentiality of Medical Information Act or the privacy, security, and breach notification rules issued by the US Department of Health and Human Services
- The privacy of pupil records under Title 2 of the Education Code
- Any additional information or explanation the data digester chooses to provide concerning its artificial intelligence training practices.
What would happen if a data digester failed to register?
Under the bill, a data digester that fails to register within 90 days of the deadline would receive notice from the CPPA, which will also post the data digester’s name on a designated website.
AB-204 proposes the following sanctions on data digesters that fail to register with the CPPA:
- A $200 fine for each day prior to its name being posted on the CPPA’s website
- A $5,000 fine for each day after its name is posted on the CPPA’s website, starting on the 15th day after the data digester’s name was posted on the website.
- The fees that the data digester owes (covering the period over which it failed to register).
- Any expenses incurred by the CPPA.
Note that the CPPA would also be permitted to make regulations setting out its interpretation of the law, as we’ve seen it make under the CCPA.
Meet Cassie
Get to know the fundamentals of the Cassie consent management platform with this downloadable guide.
Ideal for supporting conversations with key stakeholders, the guide covers…
- Cassie’s core features
- Who it’s for
- How it centralizes data
- What makes Cassie different